Introduction

So... you've got some interesting property of the network uncovered. Is it real? How can you trust that what you've found didn't arise by random chance?

One useful way of thinking by using generative models of random graphs. By "generative" and "random", we mean that the graph was generated using some statistical random model underlying it. This is essentially in the "Bayesian" tradition of modelling.

The alternative is a non-parametric way of thinking. This is where we don't assume some model that generated the data, and instead performs randomization on the data on hand to calculate statistics. Here, we assume that our randomization procedure is an appropriate random model for the network.

Statistics Primer

Discrete Distributions:

  • Bernoulli: probability p of success in 1 try (e.g. one flip of coin).
  • Binomial: probability p of success given n number of tries. n * bernoulli trials follows binomial distribution.
  • Poisson: processes with a per-unit rate.

Continuous Distributions:

  • Uniform: equal probability over the range of probable values. Can also be made discrete.
  • Normal: everyone's favourite.

We can also make up an arbitrary distribution of our own. :-)

Statistical Test of Significance of Busy Routes

If we think back to the graph made in the previous notebook, what kind of process might one imagine that gave rise to the data?

brainstorm session...

Group work.

Discrete uniform distribution over all routes

Assumptions of this model:

  1. Riders choose start station with uniform probability.
  2. They then choose a destination station with uniform probability.
  3. They ride it and the data are recorded.

Under this model, what is the distribution of number of rides taken throughout the ? We can probably solve this analytically, but since we have ultra-cheap computational capabilities at our fingertips, might as well do it the brute-force way.


In [ ]:
import networkx as nx

G = nx.read_gpickle('datasets/divvy_2013/divvy_graph.pkl')
total_trips = sum([d['count'] for _,_,d in G.edges(data=True)])
print(total_trips)

Exercise

Calculate the expected number of trips between any two stations, assuming self-loops are allowed.


In [ ]:

Exercise

Can you plot a histogram of the number of trips taken on each route?


In [ ]:

Exercise

Can you figure out the 2.5th, 97.5th, and 100th percentile of this distribution?

Hint: You may wish to use numpy's percentile function.


In [ ]:
import numpy as np

Computing the interval between the 2.5th to the 97.5th percentile effectively gives you a centered 95% mass of the distribution. Knowing this range can be quite useful in other data analysis cases.

Live Class Exercise

Together, we will implement a random model for the number of trips that are taken between two stations.

In this procedure, we will copy the nodes graph G into a new graph G_random. This can be done by:

G_random = nx.Graph()
G_random.add_nodes_from(G.nodes(data=True))

Following that, we will manually map-reduce this computation.


In [ ]:
# Grab all the nodes from G.
G_random = nx.Graph()
G_random.add_nodes_from(G.nodes(data=True))
G_random.nodes(data=True)

In [ ]:
# Implement the procedure by brute force. You will find it takes a bit of time.......... but because we have so many people 
# in the class, we can brute force parallelize this :-).
from random import sample, choice
import math

n = 40 # the total number of students with Python 3.4 installed.

for i in range(math.ceil(total_trips/n)):
    if i % 1000 == 0:
        print(i)
    n1 = choice(G_random.nodes())
    n2 = choice(G_random.nodes()) #note: n1 and n2 might well be the same nodes.
    if (n1, n2) not in G_random.edges():
        G_random.add_edge(n1, n2, count=0)
    else:
        G_random.edge[n1][n2]['count'] += 1

In [ ]:
your_id = 1 # change this number to the ID given to you.
nx.write_gpickle(G_random, 'datasets/divvy_2013/divvy_random{0}.pkl'.format(your_id))

In [ ]:
# Load the graphs with randomly re-distributed trip counts as a list of graphs.
import os

def is_pkl(filename):
    """
    Checks if a file is a pickled graph.
    """
    if filename.split('.')[-1] == 'pkl':
        return True
    else:
        return False
    
def is_divvy_random_graph(filename):
    """
    Checks if it's a Divvy graph.
    """
    if 'divvy_random' in filename:
        return True
    else:
        return False
    
# Load the data into a list of graphs.
divvy_random_graphs = []
for f in os.listdir('datasets/divvy_2013'):
    if is_pkl(f) and is_divvy_random_graph(f):
        g = nx.read_gpickle('datasets/divvy_2013/{0}'.format(f))
        graphs.append(g)

In [ ]:
# Plot the distribution of counts in a re-distributed set of trips.
import matplotlib.pyplot as plt
%matplotlib inline

G_random_master = nx.Graph()

for g in graphs:
    for sc, sk, d in g.edges(data=True):
        if (sc, sk) not in G_random_master.edges():
            G_random_master.add_edge(sc, sk, count=d['count'])
        if (sc, sk) in G_random_master.edges():
            G_random_master.edge[sc][sk]['count'] += d['count']

# Remove edges that have no counts
for sc, sk, d in G_random_master.edges(data=True):
    if d['count'] == 0:
        G_random_master.remove_edge(sc, sk)

# Plot the distribution of counts throughout the network
counts = [d['count'] for _,_,d in G_random_master.edges(data=True)]
plt.bar(Counter(counts).keys(), Counter(counts).values())
plt.show()

In [ ]:
nx.write_gpickle(G_random_master, 'datasets/divvy_2013/divvy_random_master.pkl')

Think about it...

Given the distribution of trip counts in the randomly re-distributed trips, what can we infer about the popularity of certain routes? Are they likely to have shown up in this random model?


In [ ]: